DCP-NAS: Discrepant Child-Parent Neural Architecture Search for 1-Bit CNNs

105

ߙ

ොߙ

݂ሺߙሻ

መ݂ሺොߙሻ

Optimized solution

Sub-optimized solution

݂

ߙ

כ ൌොߙ

ߙ

כ ൌොߙߙ

ߙ

כ

ߙ

כ

ොߙ

ොߙ

ොߙ

כ

ߙ

כ

ߙ

݂

ߙ

כ

ߙ

כ

ොߙ

כ

ොߙ

כ

ොߙ

כ

ߙ

כ

Optimization of real-valued architecture

Optimization of binary architecture

FIGURE 4.10

Motivation for DCP-NAS. We first show directly binarizing real-valued architecture to 1-

bit is sub-optimal. Thus we use tangent propagation (middle) to find an optimized 1-bit

neural architecture along the tangent direction, leading to a better-performed 1-bit neural

architecture.

4.4

DCP-NAS:

Discrepant

Child-Parent

Neural

Architecture

Search for 1-Bit CNNs

Based on CP-NAS introduced above, the real-valued models converge much faster than the

1-bit models, as revealed in [157], which motivates us to use the tangent direction of the

Parent supernet (real-valued model) as an indicator of the optimization direction for the

Child supernet (1-bit model). We assume that all the possible 1-bit neural architectures

can be learned from the tangent space of the Parent model, based on which we introduce a

Discrepant Child-Parent Neural Architecture Search (DCP-NAS) [135] method to produce

an optimized 1-bit CNN. Specifically, as shown in Fig. 4.10, we use the Parent model to find

a tangent direction to learn the 1-bit Child through tangent propagation rather than directly

binarizing the Parent-to-Child relationship. Since the tangent direction is based on second-

order information, we further accelerate the search process by Generalized Gauss-Newton

matrix (GGN), leading to an efficient search process. Furthermore, a coupling relationship

exists between weights and architecture parameters in such DARTS-based [151] methods,

leading to an asynchronous convergence and an insufficient training process. To overcome

this obstacle, we propose a decoupled optimization for training the Child-Parent model,

leading to an effective and optimized search process. The overall framework of our DCP-

NAS is shown in Fig. 4.11.

4.4.1

Preliminary

Neural architecture search Given a conventional CNN model, we denote w ∈W and

W = RCout×Cin×K×K and ainRCin×W ×H as its weights and feature maps in the specific

layer. Cout and Cin represent the output and input channels of the specific layer. (W, H) is

the width and height of the feature maps and K is the kernel size. Then we have

aout = ainw,

(4.19)